[1] 2 2 3 5 2
A Bayesian Approach to Modeling Home Run Production in Major League Baseball
In this research project, Dr. Parson and I sought to predict the home run (HR) production of Major League Baseball hitters.
If you work for a MLB team, predicting HR’s is important because it is the pinnacle outcome of an at-bat (AB)
And even if you don’t work for a MLB team, you could make a lot of money in Vegas if you can accurately predict HR’s!
A problem with predicting player HR’s is that there is often limited data available
Generalized linear models (GLM’s) from classical statistics have a hard fitting accurate models given limited data points - such as the 6 data point scenario mentioned above
This is where a Bayesian model shines - Bayesian models can “learn” from the data improving the effective observations it has for fitting
Pros: Bayesian model
Cons:
Sparingly uses multilevel modeling
Priors are uninformative and don’t fit data
Expect average player’s HR probability to be 0.00015
Priors effectively suggest that players could have a HR probability between 0 and 1
Strange choices of parameters
\[ \begin{align} HR_{nip} &\sim Binomial(AB_{nip},\pi_{nip})\\ \frac{\pi_{nip}}{1-\pi_{nip}} &= \alpha_{n}+\beta_n\cdot (Age_{ni}-30)+\eta_n\cdot (Age_{ni}-30)^2\\ &+\delta_p+\xi_i \end{align} \]
The amount of HR’s hit by player \(n\) in year \(i\) played at park \(p\) is binomially distributed, according to player \(n\)’s AB’s and HR probability \(\pi\) for year \(i\) at park \(p\).
The amount of AB’s in a year is given for our prediction of HR’s, but we say that the probability that the AB results in a HR, \(\pi\) varies according to a number of factors.
We can both simulate and predict the amount of HR’s of a player. For example, let’s examine a player that has 100 AB’s and a probability \(\pi\) of 0.03.
That is, \(HR \sim Binomial(100,0.03)\),
Simulation:
Prediction:
\(E(HR)=AB\cdot \pi=100\cdot0.03=3\)
Our data for the model comes from the Lahman data set. Lahman covers a variety of information on each player, including but not limited to, batting, fielding, team, and player statistics.
In order to be considered in our analysis a player must have accumulated at least 6 seasons of play, with at least 50 AB’s in each, and have played from 1973-2019.
\[ \begin{align} \alpha_n&\sim Normal(\mu_0, \sigma_0), n\in\{1,...,657\}\\ \\ \mu_0&\sim Normal(-3.5,0.1)\\ \sigma_0&\sim Exponential(1) \end{align} \]
Intercept term which represents “innate” hitting ability of players skilled enough to play in MLB
-3.5 on the logit scale is ≈0.029 or 2.9%
-3.5±0.1 = [-3.6, -3.4] = [0.0266, 0.0323] = [2.7%, 3.2%]
-3.5±2 = [-5.5, -1.5] = [0.0041, 0.182] = [0.4%, 18.2%]
No pooling would mean that each player, \(n\), gets their own \(\alpha\) estimate. That is,
\[ \frac{\pi_n}{1-\pi_n}= \alpha_{PLAYER[n]}+... \]
This intercept would just match the data we had on each player. It probably over-fits from relying on player data too much, which is sparse. You can think of this model suffers from amnesia because it assumes there is nothing in common moving from player to player.
In a total pooling scenario players share the same intercept \(\alpha\), which is just the overall mean HR probability. There are lots of data, so we can be confidnet of the value that \(\alpha\) would be, however it wouldn’t be effective to apply on a case-per-case basis. This is because few players are likely to \(\alpha\)’s which exactly match the mean of the data.
\[ \frac{\pi_n}{1-\pi_n}=\alpha+...=\overline{HR}+... \]
You can think of this model suffering from over-sharing since it cannot distinguish the \(\alpha\) from player to player.
In essence, we lose information from both no and total pooling.
\[ \begin{align} \alpha_n&\sim Normal(\mu_0, \sigma_0), n\in\{1,...,657\}\\ \\ \mu_0&\sim Normal(-3.5,0.1)\\ \sigma_0&\sim Exponential(1) \end{align} \]
Intercept term which represents “innate” hitting ability of players skilled enough to play in MLB
-3.5 on the logit scale is ≈0.029 or 2.9%
-3.5±0.1 = [-3.6, -3.4] = [0.0266, 0.0323] = [2.7%, 3.2%]
-3.5±2 = [-5.5, -1.5] = [0.0041, 0.182] = [0.4%, 18.2%]
\[ \begin{align} \beta_n &\sim Normal(\mu_1,\sigma_1), n\in\{1,...,657\}\\ \\ \mu_1 &\sim Normal(0,0.1)\\ \sigma_1 &\sim Exponential(10) \end{align} \]
Multiplicative effect representing how deviation from centered age (Age - 30) affects HR hitting ability
Age plays a factor in hitting HR’s, but it is likely not very large so we have the priors set near 0 to reflect this
\[ \begin{align} \eta_n &\sim Normal(\mu_2, \sigma_2),n\in\{1,...,657\}\\ \\ \mu_2 &\sim Normal(0, 0.01)\\ \sigma_2 &\sim Exponential(100)\\ \end{align} \]
Multiplicative effect representing how deviation from centered age squared [(Age - 30)²] affects HR hitting ability
Used to capture the non-linearity of the data without risk of over-fitting
\[ \begin{align} \delta_p &\sim Normal(\mu_5,\sigma_5),p\in\{1,...,88\}\\ \\ \mu_5 &\sim Normal(0,0.01)\\ \sigma_5 &\sim Exponential(10) \end{align} \]
Intercept term which captures the effect playing in different parks has on HR probabilities
Parks differ by both dimensions and altitude which affects HR rates
\[ \begin{align} \xi_i &\sim Normal(\mu_6,\sigma_6),i\in\{1,...,47\}\\ \\ \mu_6 &\sim Normal(0,0.25)\\ \sigma_6 &\sim Exponential(10) \end{align} \]
Intercept term which captures the effect playing in different years has on HR probability
Changes can occur because of rules, ownership goals, player goals, etc.
This term captures those changes without asking why there are changes
\[ \begin{align} HR_{nip} &\sim Binomial(AB_{nip},\pi_{nip})\\ \frac{\pi_{nip}}{1-\pi_{nip}} &= \alpha_{n}+\beta_n\cdot (Age_{ni}-30)+\eta_n\cdot (Age_{ni}-30)^2\\ &+\delta_p+\xi_i \end{align} \]
There are 2,116 parameters for this model
This model is classically non-identifiable because of 3 intercept terms!
The model predicts that the average player’s \(\pi\) is about 0.03 (on the normal scale) which is what we observe in the data
Uses Bayesian techniques to update estimates based on what the data says - allows for inference
Will our model make the Hood math department excellent gamblers?
Areas of future research
Player archetypes and physical characteristics
Considering more for a longer time interval (like Fellingham and Fisher (2017))
Better data (advanced metrics or play-by-plays)